========================================================

Abstract:

The Wine quality, the publicly available dataset were created using red and white wine sample.The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For my Explatory Data Analysis, I am considering the Red Wine dataset.

The objective here is to have an initial understanding of

  1. any relationship that may exist between the input variables / features and the output variable i.e. the quality of the red wine
  2. any relationship that may exist between the input variables themselves

Univariate Plots Section

To begin with, i would like to understand the following

Strucutre of the data: Study the data types, dimension of the data, and sample values:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Is there any NA values in the dataframe?

## 
## FALSE 
## 20787

Summary of the data:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate plots

Histogram of Wine Quality:

The above chart helps to visually explore the nature of the quality of wine from the sample set. Majority of data points are of quality ranging from 5 to 6. More importantly, the sample data has the wine quality ranging only from 3 to 8. Absence of data from the highest and lowest wine quality data potentailly could be vital. This definitely has to be considered when drawing any final conclusion on relationship between the variables

Quality of Wine samples - tabled:

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The above tabulated value gives us a preliminary understanding of the distribution of the data. Now, I would like to see if there exists any relatioship between the variable that might be of interest to form initial hypothesis.

With fair understanding of the output variable i.e. the quality of the wine, I would now atempt to understand the rest of the variables that might potentially contribute to the wine quality.

Boxplot of the variables:

Histogram of the variables:

Histogram of the variables with log transformation:

Mean, Median and Mode values of the variables:

## # A tibble: 12 x 4
##    key                  mean_value median_value mode_value
##    <chr>                     <dbl>        <dbl>      <dbl>
##  1 alcohol                 10.4          10.2        9.5  
##  2 chlorides                0.0875        0.079      0.08 
##  3 citric.acid              0.271         0.26       0    
##  4 density                  0.997         0.997      0.997
##  5 fixed.acidity            8.32          7.9        7.2  
##  6 free.sulfur.dioxide     15.9          14          6    
##  7 pH                       3.31          3.31       3.3  
##  8 quality                  5.64          6          5    
##  9 residual.sugar           2.54          2.2        2    
## 10 sulphates                0.658         0.62       0.6  
## 11 total.sulfur.dioxide    46.5          38         28    
## 12 volatile.acidity         0.528         0.52       0.6

Preliminary Inference:

  • There were no NA values and also the data seems to be tidy
  • From the above vizualizations and results, we can infer that all the input variables are normally distributed
  • Quality is a categorical variable and it makes sense to change the data type into factor
  • Also it might be helpful to categorize the data into buckets based on grade of the quality rather than actual scale. Wine property of 3 to 4 may not vary drastically and hence it might be fruitful to cluster the quality to study any commonality
  • 132 sample of red wine has Citric Acid as 0. It will be interesting to study if this has any bearing on the quality of the red wine
  • Cholrides,fixed.acidity,Residual sugar,free sulfur dioxide, Sulphates and total sulphur di oxide are positively skewed and have longer tails

Count of observations with Citric acid as 0:

## [1] 132

Create a quality grade variable (qgrade) & convert the quality into factor:

## [1] "Quality vs Grade - Counts of Sample"
##    
##     low medium high
##   3  10      0    0
##   4  53      0    0
##   5   0    681    0
##   6   0    638    0
##   7   0      0  199
##   8   0      0   18

New factor variable “quality.ordered” is created, with below levels:

## [1] "3" "4" "5" "6" "7" "8"

Univariate Analysis Inference:

What is the structure of your dataset?

The dataset has 1599 observations across 12 variables and 1 key variable. There were no NA values in the dataframe. Quality of redwine is of interest here.

What is/are the main feature(s) of interest in your dataset?

Main feature is the quality of the redwine. The purpose is to study if the other features have any influence on the quality of the wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I am interested in alcohol,chlorides, citric acid, fixed.acidity, residual.sugar as there are variations in data leading to believe that is possible that some these variation migth explain quality difference.

Did you create any new variables from existing variables in the dataset?

There are 2 changes that I am implementing
  • creating a new factor variable called quality.ordered from the quality variable as it a categorical value
  • creating a new variable called qgrade by cutting the quality variable to aid further analysis

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Cholrides,fixed.acidity,Residual sugar,free sulfur dioxide, Sulphates and total sulphur di oxide are positively skewed and have longer tails.

I created a data frame called rw_long in a long format for easier plotting. Also, I have created a dataframe called rw_long_by_keys that is grouped by features for easier eyeballing of mean, median and mode of the features. Lastly, i have created a dataframe called rw_corr to hold just the numerical variables in order to compute the correlation factor on them

Bivariate Plots Section

To being with, I would like to understand if there is any relationship among the variables in terms of correlation. I am planning to use ggpairs as a rough cut and refine further using corr plot.

ggpair plot:

Corr Plot:

The corplot above provides some insight into the relationships.

  1. Alcohol, sulphates and citric acid are Top 3 features that have good correlation with Quality.
  2. Volatile.Acidity and Quality have negative correlation
  3. Fixed.acidity, Citric.acid have negative correlation with pH as expected
  4. Volatile.acidity and pH interstingly show a positive correlation. Perhaps the volatality dilutes the concentration of the acidity in wine?
  5. Fixed.acidity, Citric.acid & Residual.sugar together clearly constitutes to the density of the wine
  6. Alcohol negatively correlates to density

quality Vs alcohol:

Summary of alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

From the above plot, as the quality of the wine increases i.e from 5 to 8, the alcohol content seem to increase as indicated by the median of the sample. Besides it will be prudent to note that within each quality bucket, there is variation in the alcohol content

quality Vs sulphates:

Summary of sulphates:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The above plot reinforces the positive correlation between sulphates and alcohol and we see that higher sulphate content in the better quality of wine. Having said that, the sulphate distribution is long tailed and as can be seen from above Mean is greater than the Median across the quality bins.

quality Vs citric.acid:

Summary of citric acid:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Variation in citric acid does seem to impact the variation in quality of the red wine as the better quality wine seem to have higher median of citric.acid in comparison to the lower quality wine across the spectrum

quality Vs volatile.acidity:

Summary of volatile.acidity:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

## [1] "Summary of volatile.acidity with quality 7"
##  volatile.acidity
##  Min.   :0.1200  
##  1st Qu.:0.3000  
##  Median :0.3700  
##  Mean   :0.4039  
##  3rd Qu.:0.4850  
##  Max.   :0.9150
## [1] "Summary of volatile.acidity with quality 8"
##  volatile.acidity
##  Min.   :0.2600  
##  1st Qu.:0.3350  
##  Median :0.3700  
##  Mean   :0.4233  
##  3rd Qu.:0.4725  
##  Max.   :0.8500

The above chart indicates the negative correlation between volatile.acidity & quality and that the higher quality wine have lesser volatile.acidity content Note that the median of volatile.acidity is 0.3700 for red wine samples with quality rating of both 7 & 8

quality Vs pH:

It is very interesting to note that pH vs quality is slightly negatively correlated giving an impression that higher acidity level leads to better wine. Is this really the case? The row 3 of the grid gives this information wherein Citric acid and fixed.acidity is positively correlated (even though fixed.acidity is weakly correlated). However volatile.acidity (VA) is bucking the trend. While one can understand how higher VA can be associated with lower quality, it is intriguing that higher VA does not result in lower pH.A little reading here helped. Apparently VA refers to the acidic elements of a wine that are gaseous, rather than liquid, and therefore can be sensed as a smell, Reference (https://www.decanter.com/learn/volatile-acidity-va-45532/#cbyaQ5UHej7Z1m1D.99)

quality Vs density:

From the above plots, a) residual.sugar have no strong bearing on the quality of wine b) density is weekly correlated with quality and seem to have negative correlation with quality which can be seen as we mve from quality grade 5 yo 8. c) density has a strong positive correlation with fixed.acidity, citric acid & residual sugar. note: geom_smooth() using method = ‘gam’ and formula ‘y ~ s(x, bs = “cs”)’ for density vs citric acid

Bivariate Analysis Inference:

  • Postive correlation was seen between quality and alcohol,sulphates and citric acid
  • Negative correlation was observed between quality and volatile.acidity
  • volatile.acidity (VA) and pH was observed to have a positive correlation. Upon further reading, it looks like VA refers to the acidic elements of a wine that are gaseous, rather than liquid and this explains why increase in VA content did not lead to acidity in redwine
  • Residual sugar did not seem to influence red wine quality grade Fixed.acidity, citric.acid and Residual sugar has a positive correlation with density

Multivariate Plots Section

Here I am looking to analyze using 2 or more features in order to dig further and clarify the understanding from the prior sections and it will be interesting to see if introducing third player reveals something that is interesting or unexpected.

alcohol Vs sulphates sliced by quality & quality grade:

From the corplot and bivariate analysis, we understood that sulphates and alcohol had a positive correlation with the quality of the red wine. The above chart elucidates this understanding as we see blue datapoints layered above yellow points which is layered above the red datapoints indicating that content of sulphates and alcohol seem to influence the quality of the wine. The bottom chart further provides insight into subtrend of each quality grade. There are outliers here of course.

alcohol Vs citric.acid sliced by quality & quality grade:

In the top chart, there are some yellow datapoints above the blue points and further there is a red data point around 1 (g / dm^3) of citric acid which implies that higher citric acid do not necessarily mean better quality and there may be other factors that influence the quality. Additionally, from the alcohol/citric.acid analysis (bottom chart) indicates that it is alcohol that is dominant in influencing the quality as explained below

  • Within the low grade, both 3 & 4 sample wine quality are seen roughly along the same citric.acid content
  • Witin the medium grade, there is clusters along x-axis indicating varying levels of alcohol delineating the quality 5 & 6
  • Within the high grade, green datapoints (quality of 7) are spread along y-axis and blue datapoints (quality of 8) are too scattered across y value to have any meaningful correlation

The clusters as along x-axis as evident from the media

sulphates Vs citric.acid sliced by quality & quality grade:

The top chart shows that the variation in citric acid along with variation in sulphates does have some influence on the wine quality between the grades. This can be seem as there are clusters of green, light green and pink datapoints.

The bottom chart is very interesting. Here between the low grade red wine it appears that citric.acid does play a part. However within the mid and high grade wine sulphates content takes over and citric.acid has a weak correlation.

alcohol Vs volatile.acidity sliced by quality & quality grade:

The above chart reveals interesting insight. We know from the bivariate analysis that Alcohol has a positive correlation with quality and that Volatile.acidity has a negative correlation with quality. When both alcohol & volatile.acidity is studied against each other: the trend is as expected for most part wherein we see high quality wine datapoint on the lower right quadrant of higher alcohol + lower volatile.acidity.

But looking at the last chart: i.e. VA vs alcohol between high grade wine, we see several datapoints with wine quality of 8 with high content of volatile.acidity. This trend was not apparent in bivariate analysis.

quality.ordered Vs density sliced by quality & quality grade:

In the top chart, there seems clusters along yaxis where high quality wine have lower volatile.acidity. This also implies that density is weakly correlated to quality.

The bottom chart gives an interesting sub trend. Studying the VA vs density variation between wine quality of 7 & 8, we see datapoints with higher volatile.acidity pertaining to wine quality of 8. I was expecting to see them associated with quality of 7.

This variation in the trend could be because of one of the following factors:
  • there is a fine optimal balance of density and volatile.acidity which yeilds a quality 8
  • there are other variables that play a part along with Volatile acidity when it comes to the High grade wines
  • low sample of wine quality 8 leading to the difference in pattern<

Multivariate Analysis Inference:

  • Alcohol followed by sulphates influence red wine the strongest and is postively correlated with the wine quality
  • Upon analyzing Citric.acid vs alchohol, citric.acid has a weak correlation with quality and the analysis showed that higher citric acid do not necessarily mean better quality and there may be other factors that influence the quality
  • Upon analyzing citric.acid vs sulphates, citric.acid has some effect among low grade wine but sulphates has a stronger correlation with quality in the mid and high quality wine
  • Based on volatile.acidity vs alcohol analysis, Higher quality i.e. samples of quality 8,showed a peculiar feature. They were observed to have higher volatile.acidity as compared to quality 7 samples.
  • Pitting volatile.acidity against alcohol, density has a weaker correlation to quality then volatile.acidity

Final Plots and Summary

Plot One

Description One

The above chart was helpful and provided a dashboard view to understand quality viz a viz density, fixed acidity, citric acid & residual sugar in addition to providing a view of density vs fixed acidity, citric acid & residual sugar. The take away from this is the inference that

  • density is strongly correlated to fixed acidity, citric acid & residual sugar
  • density by itself is weakly correlated to quality
  • residual sugar has negligible correlation to quality
  • citric acid and fixed acidity has a positive correlation with quality
  • <>

Plot Two

Description Two

This plot clearly indicates the segmentation in the wine quality grade in relation to the two significant properties i.e. alcohol and sulphates. The high wine quality grade indicated by blue dots is seen in the right upper quadrant followed by the medium quality wine grade indicated by yellow with low quality wine grade at the bottom as indicated by the orange datapoints. This to me showed the variation in the input properties i.e. alchohol and sulphates and their plausible effect on the quality variable.

The same pattern could be seen in the bottom chart which further breaks the quality down to further individual granularity.

This inference is agains based on the given sample and it is prudent to caution that correlation do not imply causation here.

Plot Three

Description Three

This chart shed very nice insight and provided the benefit of aggregating and drilling down into the data. The top chart provides an aggregated view of the alcohol and volatile.acidity by the quality gradation. It painted a nice picture of how fine quality wine had a lower volatile.acidity content and higher alcolhol as a rule of thumb. But breaking it down to the granular level of quality index, i was suprised that highest quality wine of 8 had a higher volatile.acidity content breaking my assumption built from carpot and univariate analysis.


Reflection

Being a teetotaller, I approached this analysis with no prior subject knowledge perhaps it is a good thing as my only biased opinion was residual.sugar must be influencing the wine quality. But it turned out not to be :)

When I started analyzing the exploration, the inherent relationship among the variables were not intuitive. The phased structure of the exploration i.e. univariate, followed by bivariate and multivariate helped making a initial hypothesis and subsequently either validating or refining the hypothesis about the data.

Studying multiple variables helped scratch multicolinearity and did throw some unexpected result as explained in the above section.

For my future work, i would like see to study the following:
  1. Explore suitable machine learning algorithm a the dataset lends itself to
    a classification algorithm
  2. Apply Kbest feature selection to validate if I had missed any strong features
  3. Explore if any new features could have been engineered from the dataset and if it could have had a stronger correlation

Lastly, it would be nice to have a bigger sample size particulary across all quality bin to derive further insights.